Conversation
There was a problem hiding this comment.
Pull request overview
Adds seek support for Avro container readers by tracking block boundaries (offset + record count) during iteration and exposing an API to seek back to a previously-seen block for Read + Seek inputs.
Changes:
- Introduced
BlockPositionand internal position tracking to record block start offsets as blocks are loaded. - Added
Reader::{current_block,data_start,seek_to_block}(seek API gated onSeek) plus tests validating seeking between blocks. - Added a new error detail variant for seek failures.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
avro/src/reader/mod.rs |
Exposes BlockPosition and adds reader-level block position + seek APIs with new tests. |
avro/src/reader/block.rs |
Implements BlockPosition, PositionTracker, and block-level seek + block-boundary tracking. |
avro/src/lib.rs |
Re-exports BlockPosition from the crate root. |
avro/src/error.rs |
Adds Details::SeekToBlock for I/O seek errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Do you have a use case for this functionality ? |
| let n = self.inner.read(buf)?; | ||
| self.pos += n as u64; | ||
| Ok(n) | ||
| } |
There was a problem hiding this comment.
The implementation assumes that read_buf(), read_exact(), read_vectored(), ... use their default impls and delegate to read(). But if they are overwritten then the tracking breaks.
There was a problem hiding this comment.
Is there any common examples for this behavior?
There was a problem hiding this comment.
The tracking is done by wrapping the reader in a PositionTracker right? So it will always go through PositionTracker::read.
|
|
||
| self.current_block_info = Some(BlockPosition { | ||
| offset: block_start, | ||
| message_count: block_len as usize, |
There was a problem hiding this comment.
| message_count: block_len as usize, | |
| message_count: self.message_count, |
There was a problem hiding this comment.
I actually wrote this part specifically this way to ensure it would not lost, if a refactoring or other change are applied. It cannot rely on the meaning of the self.message_count and its current value if those two will be separated into different places of code
There was a problem hiding this comment.
You can move self.message_count = block_len as usize down next to self.current_block_info to reduce that risk. As I do think Martin's suggestion is reasonable.
There was a problem hiding this comment.
I'm not arguing, any of this would work and I can change, that's no problem. But I'm curios to understand why you think this would be better?
My idea is just using the source of truth variable, so there is no way it gets a wrong number. Using the self.message_count works with the current code, but it doesn't give any guarantee if it changes. So I'm wondering why the second approach is better?
I need to read just a few records from a large Avro file, and without this it's incredibly inefficient as I need to read the whole file from the start each time. |
Added a feature to seek to a particular block when reading an Avro file.
The Reader now provides the current Block position and for a Seek'able input it can seek to the specific position, assuming it's a valid start position of a block.